A Hybrid of Random Forests and Generalized Path Analysis: A Causal Modeling of Crashes in 52,524 Suburban Areas

Background: Determining suburban area crashes’ risk factors may allow for early and operative safety measures to find the main risk factors and moderating effects of crashes. Therefore, this paper has focused on a causal modeling framework. Study Design: A cross-sectional study. Methods: In this study, 52524 suburban crashes were investigated from 2015 to 2016. The hybrid-random-forest-generalized-path-analysis technique (HRF-gPath) was used to extract the main variables and identify mediators and moderators. Results: This study analyzed 42 explanatory variables using a RF model, and it was found that collision type, distinct, driver misconduct, speed, license, prior cause, plaque description, vehicle maneuver, vehicle type, lighting, passenger presence, seatbelt use, and land use were significant factors. Further analysis using g-Path demonstrated the mediating and predicting roles of collision type, vehicle type, seatbelt use, and driver misconduct. The modified model fitted the data well, with statistical significance ( χ302 = 81.29, P<0.001) and high values for comparative-fit-index and Tucker-Lewis-index exceeding 0.9, as well as a low root-mean-square-error-of-approximation of 0.031 (90% confidence interval: 0.030-0.032). Conclusion: The results of our study identified several significant variables, including collision type, vehicle type, seatbelt use, and driver misconduct, which played mediating and predicting roles. These findings provide valuable insights into the complex factors that contribute to collisions via a theoretical framework and can inform efforts to reduce their occurrence in the future.

the relationships between independent and dependent variables. If the assumptions are violated, biased estimations and improper inferences can be obtained. 8 Machine learning techniques as applied statistical methods have been considerably utilized in data analysis. These techniques do not contain pre-defined relationships between study variables, and the prediction is available without needing to understand essential mechanisms. These methods are currently successful due to the development of computational power. 6,9 Additionally, even though large population studies are routinely used to estimate the effect of predictors in actual situations, they are subject to confounding bias due to the lack of randomization. Hence, methods from the causal inference framework could be investigated as a strategy for developing sound and relevant science. Moreover, there is always difficulty with the number of variables that must be entered into the conceptual diagram of causal modeling, particularly in traffic studies with many risk factors. First, relying solely on substantive knowledge makes it challenging to detect true confounders. Second, neglecting a true confounder could result in biased conclusions, while accounting for non-confounders could raise variance. 10,11 Based on the literature in various disciplines, random forests (RF) as machine learning techniques and path analysis as a causal approach were revealed to be a good approach for road traffic crash injury severity prediction. 12,13 The RF proves to be a reliable algorithm for feature selection, even if the number of features is high. It has proven itself to be reliable, robust, and efficient. Furthermore, it outperforms other black-box algorithms as it is trained by a bootstrap aggregating (bagging) algorithm. This not only enhances the stability and accuracy of individual trees but also reduces variance and prevents over-fitting. The RF is also known for its interpretable model by producing a set of boosted ifthen rules. 14,15 Path analysis is a useful statistical tool for investigating the causal relationships between variables. It combines bivariate and multi-variable linear regression to examine the causal relations among the variables in model. 16 This method can accurately determine the influence and significance of the relationship between various variables. 17 In this paper, a hybrid random forest generalized path analysis (HRF-gPath) method was proposed to maintain sufficient number and efficient variables in the causal model of suburban area crashes. Beyond the methodological novelty proposed in this paper, combining these methods would lead to optimal feature selection and provide a powerful causal approach for a better conclusion. The results of this study can prepare guidelines and provide information for specialists to decide on the crucial risk factors of traffic crashes in suburban areas based on scientific evidence.

Study design
This cross-sectional study analyzed the information on suburban crashes recorded in Integrated Road Traffic Injury Registry System (IRTIRS) 18

Ethics approval and consent to participate
The study was conducted following the Declaration of Helsinki and approved by the Institutional Review Board (#1396.465) and the Ethics Committee (#IR.TBZMED. REC.1398.1244) of Tabriz University of Medical Sciences, Iran. Participation in the study was voluntary for everyone, and participants' privacy was respected. The participants were assured that their personal information would remain confidential and not be disclosed. Informed consent was obtained from both the adult participants and the parent(s)/guardian(s) of all under-16s; furthermore, informed consent was obtained from legal guardians or next of kin for illiterate participants. All methods were performed following the relevant guidelines and regulations. Finally, informed consent was obtained from all individual participants included in the study.

Data collection and study variables
The scene of the crash-, vehicle-, and driver-based information was collected in the most critical provinces in Iran, which are either capital city destinations, tourism destinations, or free zone areas. Crash-based information included passenger presence, pedestrian presence, crash day, crash type, time, lighting status, weather, zone type, intersection control, line making, road material, land use, crash mechanism, view obstacle, and crash position. Other crash-related information were road surface, geometric design, vehicle factor, human factor, cause of the accident, collision type, distinct, road shoulder, road defect, permitted speed, and road repairing status. Moreover, vehicle-based information contained vehicle safety equipment, type, color, life, maneuver, plaque description, moving direction, and maneuver. Eventually, driver-based information included age, gender, education, job, driving license, seat belt usage, judiciary cause, and misconduct. This study divided the district into three categories, including tourist destinations, capital destinations, and free zones. As the final issue, the crash severity has three categories: property damage, injury, and fatality. Based on the study purpose, severity data were categorized into two distinct categories, including (1) damage or injury as a non-fatal crash (Y = 0) and (2) fatality as a fatal crash (Y = 1). There were 2,399 (4.57%) fatal crashes out of 52,524 suburban crashes. Overall, the information related to 42 explanatory variables was recorded, the details of which are presented in Table 1. . The proposed hybrid model initiates with the RF classifier for variable selection, followed by generalized path analysis to conduct causal modeling. In the first step of the proposed HRF-gPath model, the RF classifier efficiently reduces less important variables and enhances the proposed model's generalization capabilities. The RF is a supervised machine learning technique introduced by Breiman's 19 and focuses on the "decision tree" approach implemented in the classification and regression tree methodology. The decision tree is considered a technique for classifying data that are divided into groups based on the value of a particular variable. Then, it repeats this division such that each data group comprises objective variables in the same category. In this method, the basis of most decisions is classification. In addition, the importance of each variable and the contribution of each variable in data classification can be determined by the created decision trees. This study used classification algorithms to predict a categorical dependent variable. The risk was calculated as the proportion of cases incorrectly classified by the trees. The Gini index (GI) was employed to reduce the node impurity. Our optimal model was trained to have a GI around 0.1. To control all key aspects of the estimation procedure and model parameters, including the complexity of the trees fitted to the data, the maximum number of trees in the forest was set to 100. Additionally, to control how to stop the algorithm when achieving satisfactory results, the maximum number of leaves was set to 10. 19 The data were randomly split into training and test sets so that the training set consisted of 80% of the full data set, while the test set comprised the remaining 20%. The training set was utilized to fit (train) the model. The test set was used to evaluate the fitted RF performance and determine whether it is overfitting. The research team took the mid-point of 0.5 as the cutoff point for deciding on the feature selection criterion and introducing it to gPath analysis.

Statistical analysis
To maximize the advantages of the algorithm in this hybrid approach and to bring it into the causal framework, the output data from the RF classifier with the selected variables were then presented to the gPath to fit a causal model to the data. There were six steps in each path modeling, including model specification, model identification, model estimation, model testing, model  21,22 The root means square error of approximation (RMSEA) was the next measure of goodness-of-fit, with values below 0.05 being considered a good fit and values up to .08 representing acceptable errors in the population. 20 For an inadequate model, the model modification includes adjusting an identified and estimated model through modification indices provided by the model. In this study, the bootstrap method was utilized for model validation.

Results
From March 2015 to March 2016, IRTIRS registered 384 614 traffic crashes. The suburban area crashes comprised 52 524 (13.66%) of the causalities. The fatality rate among these crashes was 4.6% (2399 cases). Table 1 provides details about the frequency distribution of crash scenes, vehicles, and driver-related variables describing the crashes.

Results of the random forests model
The results of RF feature selection demonstrated that 12 variables, namely, collision type, distinct, driver misconduct, permitted speed, driver's license, plaque description, vehicle maneuver, vehicle type, lighting status, passenger presence, driver seat belt, and land use, were derived as significant variables. Risk estimates and corresponding standard errors were 0.046 and 0.001 for the training and test sets. Figure 1 recapitulates the results of the RF model in more detail.

Results of the hybrid RF-gPath model
Although the RF method was used to select variables, understanding the potential for multicollinearity between the inventory variables, we checked the correlation between independent variables to ensure they were not highly correlated. Figure 2 shows a correlation matrix for all the variables introduced to the causality model. The color coding represents how correlated two variables are, with dark blue and dark red squares representing a strong positive correlation ( + 0.7 to + 1) and a strong negative correlation (-1 to -0.7), respectively. 23 According to the figure, the correlations between variables are not strong enough for any substantial collinearity or multicollinearity. 24 A conceptual model of variables extracted from the RF model (Figure 3a) was constructed to answer the research question. Figure 3b illustrates the modified model, where the values on the arrows represent standardized regression coefficients from one variable to another, which are the direct effects. The modified model fitted the data reasonably enough with 2 30  = 81.29, P < 0.001, χ 2 /df = 2.71 < 5, CFI = 0.97 > .9, TLI = 0.95 > 0.9, and RMSEA = 0.031 < 0.08 (90% confidence interval [CI]: 0.030 to 0.032). Table 2 provides direct, indirect, and total effects ending in the outcome. Bootstrap confirmed the model validation as having an acceptable overlap of method confidence intervals with model-derived confidence intervals and negligible biases.

Indirect effects
All coefficients on the perfect fitted model were statistically significant at the 0.05 level of significance, except for the path from the vehicle plaque description and land use toward fatality, as well as the path from permitted speed

Discussion
This is the first study that discovered the applicability of the innovative HRF-gPath model for detecting causal relationships and predicting fatality in suburban crashes. The proposed novel HRF-gPath chose a reasonable number of features and showed their direct and indirect relationships.
Interestingly, the association between vehicle maneuver, presence of passenger, lightning status, and driver misconduct paths with fatality were mediated by collision type. Moreover, distinct, driver's license and plaque descriptions affected the vehicle type and, consequently, fatality, which is consistent with the findings of a previous study. 25 The relationship between vehicle types by fatality was mediated by seat belt use. Furthermore, driver misconduct played a mediator role in assessing the relationship between fatality and variables such as vehicle maneuver, driver license, presence of a passenger, lightning status, and vehicle type. Collision type, vehicle type, seat belt use, and driver misconduct demonstrated a significant relationship with fatality. Therefore, this explored model could be considered a typical practical, theoretical framework to explain how the collision type, vehicle type, seat belt use, and driver misconduct can predict and mediate fatality in suburban crashes. Further studies can modify and establish this model.
Based on the results of the present study, vehicle maneuver, presence of a passenger, lightning status, and driver misconduct could be considered significant predictors of collision type. The significant relation between vehicle maneuvers and collision type indicates that different vehicle maneuvers would lead to different collision types. Overtaking while driving, as the main cause of head-on collisions with serious consequences, can be a salient example of this relationship, as reported in other studies. 26,27 Consistent with the results of international research, the presence of a passenger may reduce attention to the driving task and exert direct or indirect psychological pressure to drive less safely. In the same vein, it can be assumed that the presence of a passenger may lead to increased stress and thus reduced driving performance. 28 However, we cannot make any assumptions about the risky role of passenger presence, which is similar to the finding of the study conducted by Orsi et al. They concluded that young drivers, carrying passengers, were particularly vulnerable in single-vehicle collisions; yet, for adult drivers, this collision was more harmful if the driver was alone in the vehicle, 16 which is in line with the results regarding the relationship between lightning status and collision type. The studies assessing rear-end crash exposure methodology revealed that daytime was attributed to many rear-end collisions. 29 Studies have reported driver misconduct as a predictor of collision type. Goel and Sachdeva had studied the reasons for the collisions, their kind when they occurred, and the kind of the involved vehicle. They found that head-on or rear-end collisions are mainly due to driver misconduct. 30 Considering the division of distinct (tourism destination, capital city destination, and free zone), the relationship between the distinct and the vehicle type is quite clear. Based on the results of this study, the distribution of heavy vehicles in the capital destination has a different pattern than in a tourist destination and the free zone. Tehran, the capital of Iran, is the economic center of Iran, with more than 45% of large industrial factories. 31 Therefore, these factories increase the use of heavy vehicles for road freight transport. Similar studies represented that freight vehicles are heavier and increase the kinetic energy in accidents compared to passenger vehicles. In addition, capital cities usually have limited infrastructure for freight infrastructure, including loading space, road space, and parking, to accommodate the increasing freight traffic. These limitations further challenge the safe and efficient operation of heavy vehicles. 32 According to the results of similar studies in Iran, car by itself has effects whether or not drivers decide to use seat belts. For example, sport utility vehicles and van drivers are less likely to use seat belts. 33,34 Among all variables, the presence of a passenger was a stronger predictor of diver misconduct. Talking to the passenger has been identified as a distractor and a predictor of driver misconduct. 35 It has been concluded that professional drivers have a lower probability of risky driving behaviors. However, this is in contrast with the findings of a study by Mekonnen et al, indicating that diver misconduct is common among professional drivers. 36 As a third significant predictor of driver misconduct, vehicle maneuver plays a crucial role. Based on the findings of similar studies, the likelihood of misconduct increases by 2.98 and 2.15 times for drivers who engage in overspeeding and those who frequently make dangerous overtakes, respectively. 37 Lightning status is the other significant predictor of driver misconduct. There is solid evidence from some studies that driving in dim light makes it harder to prevent crashes. As the number of miles traveled at night is significantly lower than during the day, drivers are more likely to drive faster during the daytime than at night. 37,38 In terms of the relationship between driver misconduct and vehicle type, it is believed that as the key participant in the goods industry, drivers of heavy vehicles are one of the main factors of traffic safety. In the study of traffic collisions involving heavy vehicles, it was declared that 90% were found to be the result of driver misconduct. 39 As the first limitation, there is no precise and detailed registry system in the country to combine this information with hospital information. As a result, only information on death at the scene is available, and therefore the results cannot be generalized to cases of death in the hospital. Another problem of this study is that accidents are probably not reported fully to the authorities. Focusing on the data between 2015 and 2016 and a restriction to access data from 2016 to 2021, which would enlarge and improve this research, can be considered the main limitation of this study. Like most classification problems, this study is limited by its imbalanced data. Although balancing data before conducting a random forest model can improve model performance and accurate evaluation metrics, it may lead to information loss, time and computational resources increase, and real-world imbalance mismatch. Hence, experimenting with both balanced and imbalanced datasets to assess the impact on model performance and choose the approach that best aligns with the problem is recommended for further studies.
On the other hand, this study was based on information from six densely populated provinces of the country, thus this can be considered the study's first strength, making the results generalizable. This study introduced a hybrid approach for analyzing traffic crash data to develop a parsimonious model for suburban area crashes, which can be another study strength.

Conclusion
The proposed novel HRF-gPath model helped us identify reasoned pathways of fatal crashes in suburban areas. When exogenous and mediator variables are modeled together, all may predict fatality. As mediator variables, collision type, vehicle type, seat belt use, and driver misconduct originate from risk factors underlying this predicament. It is suggested that further research explores the unseen biases of the issue. Healthcare providers, police, and psychologist should consider the dominance of mediators explored in this study while designing prevention programs for suburban area crashes.